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Abstract 

Motor  actions  in  speech  production  are  both  rapid  and  highly  dexterous,  even  though 
speed  and  accuracy  are  often  thought  to  conflict.  Fitts’  law  has  served  as  a  rigorous 
formulation  of  the  fundamental  speed-accuracy  tradeoff  in  other  domains  of  human  motor 
action,  but  has  not  been  directly  examined  in  the  domain  of  speech  production.  The 
present  work  seeks  evidence  for  Fitts’  law  in  speech  articulation  kinematics  by  analyzing 
USC-TIMIT,  a  large  database  of  real-time  magnetic  resonance  imaging  data  of  speech 
production.  A  theoretical  framework  for  considering  Fitts’  law  in  the  domain  of  speech 
production  is  elucidated.  Methodological  challenges  in  applying  Fitts-stylc  analysis  are 
addressed,  including  the  definition  and  operational  measurement  of  key  variables  in 
real-time  MRI  data.  Results  suggest  the  presence  of  clear  tradeoffs  between  speed  and 
accuracy  for  certain  types  of  speech  production  actions,  with  wide  variability  across 
syllabic  position,  and  substantial  variability  also  across  subjects.  Coda  consonant  targets 
immediately  following  the  syllabic  nucleus  show  the  strongest  evidence  of  this  tradeoff, 
with  correlations  as  high  as  0.72  between  speed  and  accuracy.  Results  are  discussed  with 
respect  to  potential  limitations  of  Fitts’  law  in  the  context  of  speech  production,  as  well  as 
the  theoretical  context.  Future  improvements  in  application  of  Fitts’  law  are  discussed. 
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Speed-Accuracy  Tradeoffs  in  Human  Speech  Production 

Introduction 

The  present  work  applies  certain  influential  ideas  of  Paul  Fitts  Fitts  (1954)  in  the  domain 
of  speech  production,  specifically  his  formulation  of  so-called  speed-accnracy  tradeoffs  in 
human  motor  action.  Fitts  was  primarily  concerned  with  quantifying  the  capacity  of  the 
human  motor  system  to  perform  motor  actions.  One  important  outcome  of  that  work  was 
a  rigorous  formulation  of  perhaps  the  most  robust  and  widely  replicated  laws  of  human 
motor  action:  for  discrete,  targeted  actions,  the  time  taken  to  complete  a  movement 
displays  a  linear  relationship  with  task  difficulty,  where  difficulty  is  a  function  of  movement 
distance  and  the  tolerable  error  in  reaching  the  target.  This  law  is  typically  described  as  a 
speed-accuracy  relationship,  given  the  intuitive  notion  that  movement  time  and  speed  are 
quantities  that  are  closely,  if  inversely,  related,  and  that  tolerable  error  is  the  reciprocal  of 
accuracy.  This  now  well-known  relationship  has  subsequently  been  referred  to  as  Fitts  ’  law, 
and  has  been  used  widely  to  model  speed-accuracy  tradeoffs  in  a  variety  of  human 
movement  domains.  Example  application  domains  include  manual  pointing  and  reaching 
(as  in  Fitts’  original  study),  targeted  foot  movements  Drury  (1975),  balance  and  posture 
Duarte  &  Freitas  (2005),  and  computer  device  interaction  Card  et  ah  (1978).  Fitts’  law  has 
also  been  applied  to  ballistic  movements,  including  eye  saccades  Ware  &  Mikaelian  (1987), 
although  there  is  meaningful  debate  over  whether  movements  that  do  not  rely  heavily  on 
feedback  are  subject  to  the  same  law  Carpenter  (1988);  Sibert  &  Jacob  (2000);  Drewes 
(2013). 

It  is  not  well-established  whether  this  pervasive  law  of  human  movement  is  obeyed  by 
speech  motor  actions.  Despite  evidence  that  speech  articulation  obeys  related  tradeoffs 
among  metrics  of  speed,  distance  and  curvature  Lofqvist  &  Graeco  (1997);  Perrier  &  Fuchs 
(2008);  Kato  et  al.  (2009),  Fitts’  law  has  not  been  directly  examined  in  the  context  of 
speech  production.  Motor  actions  associated  with  speech  production  are  some  of  the  most 
rapid  and  dexterous  that  humans  execute.  The  presence  of  speed-accuracy  tradeoffs  would 
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imply,  however,  that  it  is  not  necessarily  possible  to  attain  high  levels  of  speed  and 
accuracy  at  the  same  time.  Moreover,  in  speech  production,  there  are  potentially  multiple 
domains  in  which  accuracy  may  be  demanded,  ranging  from  articulatory  and  acoustic,  to 
prosodic  and  communicative,  with  all  of  these  demands  being  possibly  simultaneous  and 
overlapping.  Kinematics  are  the  present  focus  because  many  human  motor  actions  exhibit 
a  clear  kinematic  tradeoff  between  speed  and  accuracy.  The  present  paper  examines  one 
aspect  of  accuracy  in  speech  actions:  the  kinematics  of  “reaching”  for  maximal  articulatory 
targets.  Articulatory  speech  actions  can  be  conceptualized  as  discrete  motor  actions 

Speed-accuracy  tradeoffs  can  provide  a  window  into  the  control  mechanisms  of  directed 
movements.  While  it  is  possible  that  biomechanical  constraints  exist  that  give  rise  to  such 
tradeoffs,  there  is  also  good  reason  to  believe  that  they  are  the  result  of  properties  of 
planning  and  control.  It  can  be  shown  that  Fitts’  law  is  consistent  with  traditional  models 
of  feedback-driven  motor  control  Langolf  et  al.  (1976).  Moreover,  it  is  closely  related  to 
models  of  neural  dynamics  of  movement  trajectory  formation  Bullock  &  Grossberg  (1988). 
If  it  is  true  that  control  mechanisms  bring  about  speed-accuracy  tradeoffs,  it  implies  that 
changes  in  timing  can  be  used  to  assess  demands  in  accuracy  and,  conversely,  that  changes 
in  accuracy  can  be  partially  attributed  to  speaking  rate  demand. 

The  presence  of  Fitts-type  tradeoffs  in  speech  production  would  help  to  explain  a  variety  of 
observed  phenomena.  It  has  been  argued  that  speech  motor  actions  vary  considerably  in 
difficulty,  and  that  differences  in  difficulty  relate  to  elements  of  timing.  Hardcastle 
Hardcastle  (1976)  asserted  that  the  difficulty  (or  complexity,  to  use  his  terminology)  of  an 
articulatory  action  should  be  defined  in  terms  of  both  the  number  of  articulatory  variables 
that  are  recruited  over  the  course  of  that  action,  and  in  terms  of  the  precision  required  for 
each  of  those  variables.  The  issue  of  articulatory  precision  and  its  kinematic  consequences 
is  entirely  compatible  with  Fitts’  law.  Hardcastle  goes  on  to  make  direct  reference  to  a 
speed-accuracy  tradeoff  in  speech  production,  while  arguing  that  fricatives  require  more 
precision  than  stop  consonants:  “One  of  the  possible  effects  of  this  greater  precision  is  that 
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the  articulators  involved  in  the  production  of  a  fricative  might  move  more  slowly  than  for 
the  production  of  a  stop."  Hardcastlc  notes  that  this  may  help  to  explain  why  vowels  are 
often  lengthened  in  advance  of  fricatives  (i.e. ,  more  time  is  required  to  execute  the  more 
difficult  fricative  articulation)  -  as  originally  suggested  by  MacNeilage  MacNeilage  (1972)  - 
and  lower  vowels  are  longer  than  higher  vowels  Lchiste  (1970)  (i.e.,  more  time  is  required 
for  the  tongue  to  travel  the  longer  distance).  This  is  also  a  possible  explanation  for  the 
observation  that  fricatives  have  longer  durations,  in  general,  than  stops  Knwabara  (1996). 

Speed-accuracy  tradeoffs  may  also  aid  in  explaining  changes  in  speed  and  accuracy  during 
speech  acquisition  and  loss.  Speaking  rate  decline  is  associated  with  various  kinds  of 
neurological  decline  Yunusova  et  al.  (2008);  Williamson  et  al.  (2015).  Change  in  speaking 
rates  are  perhaps  a  compensatory  mechanism  in  order  to  maintain  accuracy  when  difficulty 
increases.  Even  in  normal  speakers,  accuracy  and  intelligibility  decline  at  markedly 
increased  speaking  rates  Kleinow  et  ah  (2001);  Krause  &  Braida  (2002).  The  notion  of 
articulatory  difficulty  may  also  help  to  explain  why  fricatives  tend  to  be  acquired  later 
than  stops  Templin  (1957),  and  why  some  productions  are  more  quickly  impacted  when  the 
condition  of  the  motor  system  changes,  as  in  the  idea  that  sleepiness  and  alcohol 
intoxication  lead  to  the  salient  changes  in  fricatives  associated  with  “slurred  speech”  Chin 
&  Pisoni  (1997);  Schuller  et  ah  (2014).  Better  knowledge  of  the  presence  and  nature  of 
speed-accuracy  tradeoffs  in  speech,  therefore,  would  have  natural  applications  toward 
identifying  those  elements  of  speech  production  that  are  early  indicators  of  neurological 
change  or  decline. 

The  purpose  of  this  paper  is  three-fold.  The  primary  goal  is  to  analyze  speech  articulation 
using  a  large  database  of  real-time  magnetic  resonance  (rtMRI)  data,  in  order  to  assess 
whether  articulatory  kinematics  conform  to  Fitts’  law.  A  second,  associated  goal  is  to 
address  the  methodological  challenges  inherent  in  performing  Fitts-style  analysis  on  rtMRI 
data  of  speech  production.  Methodological  challenges  include  segmenting  continuous 
speech  into  specific  motor  tasks,  defining  key  variables  of  Fitts’  law  in  the  domain  of  speech 
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articulation,  and  deciding  how  to  operationalize  these  definitions  and  extract  related 
measures  from  complex  and  high-dimensional  rtMRI  data.  Finally,  a  third  goal  is  to 
present  a  novel  mathematical  argument  for  Fitts’  law  in  speech  production,  and  make  a 
theoretical  argument  for  why  one  would  expect  to  observe  behavior  consistent  with  the 
law.  Section  2  gives  a  brief  introduction  to  the  concepts  and  mathematics  behind  Fitts’ 
law,  and  presents  an  argument  for  Fitts’  law  in  speech  production.  Section  3  describes  the 
data  used  in  the  present  study,  and  the  necessary  pre-processing  for  the  task  being 
considered.  Section  4  explains  the  present  approach  to  applying  Fitts’  law  in  the  domain  of 
speech  production  data.  The  results  of  applying  the  proposed  methodology  to  rtMRI  data, 
and  a  discussion  of  the  results  in  terms  of  the  goals  of  the  paper,  are  given  in  Section  5. 
Lammert  et  al.  Lammert  et  al.  (2016)  have  previously  reported  on  an  initial  effort  to  meet 
some  of  the  goals  of  the  present  work  by  analyzing  portions  of  the  USC-TIMIT  database 
and  forming  necessary  elements  of  the  data  analysis.  This  paper  constitutes  a  substantial 
expansion  of  that  work,  providing  a  more  extensive  and  deeper  analysis  on  more  subjects, 
as  well  as  a  better  developed  framework  for  considering  speed-accuracy  trade  offs  in  a 
speech  production  context.  In  particular,  the  present  work  provides  (1)  an  analysis  of  six 
additional  subjects,  altogether  comprising  the  entirety  of  the  real-time  data  from  the 
USC-TIMIT  database,  (2)  a  more  detailed  look  at  speed-accuracy  relationships  in  speech 
tasks  of  different  varieties,  specifically  tasks  situated  in  different  parts  of  the  syllable,  and 
(3)  elucidation  of  a  theoretical  framework  for  considering  Fitts’  law  in  the  domain  of 
speech  production,  and  its  mathematical  connection  to  prominent  models  of  speech  motor 
control  and  neural  control  of  movement. 

Background 

Fitts’  law  can  be  stated  precisely  in  mathematical  terms.  It  has  deep  connections  with 
several  prominent  frameworks  of  directed  human  motor  control.  This  section  is  intended  to 
provide  an  overview  of  Fitts’  law,  including  the  mathematical  statement  thereof,  as  well  as 
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connections  to  the  Task  Dynamics  control  framework  Saltzman  &  Kelso  (1987);  Saltzman 
&  Munhall  (1989),  and  the  VITE  model  of  neural  control  of  directed  human  movement 
Bullock  &  Grossberg  (1988). 

Statement  of  Fitts’  Law 

Given  a  target  associated  with  a  given  task,  as  well  as  an  initial  position  (also,  context),  key 
parameters  of  that  action  can  be  defined,  and  incorporated  into  a  simple  framework  that 
represents  the  difficulty  associated  with  that  task.  One  parameter  is  the  distance  to  the 
target  from  the  initial  position.  Longer  distances  are  assumed  to  make  a  task  more 
difficult.  The  other  parameter  is  the  width  of  the  target,  which  represents  the  tolerable 
error  in  reaching  the  target.  A  wider  target  is  assumed  to  make  a  task  less  difficult, 
perhaps  corresponding  to  more  slack  being  permitted  in  declaring  an  action  successful. 

The  ratio  of  the  distance  to  the  target,  D,  and  its  width,  W,  are  then  associated  with  the 
index  of  difficulty  (ID)  in  the  following  way: 

ID  =  (w)  (1) 

The  reciprocal  ratio  W/D  constitutes  one  definition  of  the  accuracy  of  a  task.  Taking  the 
base-2  logarithm  of  this  accuracy  measures,  then,  gives  the  ID  units  that  can  be 
interpreted  as  bits,  inspired  by  Claude  Shannon’s  information  theory  Shannon  &  Weaver 
(1949).  The  ID,  having  encapsulated  a  notion  of  accuracy  of  action,  should  then  be  related 
to  the  movement  time  (MT)  associated  with  a  given  task,  under  the  hypothesis  that  a 
tradeoff  exists  between  speed  and  accuracy  of  that  task.  This  relationship,  Fitts’  law,  is 
commonly  formulated  as  a  simple,  linear  one: 

MT  =  a-ID  +  b,  (2) 

where  a  and  h  are  constants,  the  values  of  which  depend  on  the  task  and  characteristics  of 
control.  Fitts’  law  has  been  derived  in  various  ways  since  the  original  formulation 
Crossman  &  Goodeve  (1983);  Bullock  &  Grossberg  (1988);  Beamish  et  ah  (2006). 
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Note  that,  whereas  the  distance  associated  with  a  task  is  typically  fairly  straightforward  to 
define  given  an  initial  position  and  a  target  (e.g.,  the  Euclidean  distance),  the  width 
parameter  has  been  defined  in  many  different  ways.  Fitts’  original  experiments  included 
targets  with  a  literal,  physical  width  of  varying  size,  but  many  experimental  setups  have 
only  a  point  target  (as  assumed  in  many  human  actions).  In  the  domain  of  speech 
production,  however,  one  is  faced  with  an  added  complication  stemming  from  a  lack  of 
consensus  regarding  how  an  articulatory  target  should  be  defined,  or  indeed  whether  an 
articulatory  target  (as  opposed  to  acoustic)  exists  at  all.  In  the  present  work,  it  is  assumed 
that  articulatory  targets  do  exist,  following  the  specific  definition  explained  below. 


It  is  worth  noting  certain  subtleties  with  regard  to  the  interpretation  of  Fitts’  law  as  an 
expression  of  a  speed-accuracy  trade  off.  Much  of  the  literature  related  to  Fitts’  law 
interprets  the  law  as  such  a  trade  off,  either  implicitly  or  explicitly.  For  example,  Fitts  & 
Radford  Fitts  &  Radford  (1966)  discuss  the  variables  MT  and  W  as  representing  speed  and 
the  reciprocal  of  accuracy,  respectively.  The  law,  under  this  interpretation  of  the  variables, 
is  therefore  an  expression  of  a  speed-accuracy  trade  off,  with  that  additional  caveat  that 
accuracy  must  always  be  considered  relative  to  D.  This  interpretation  of  Fitts’  law  assumes 
that  “speed”  is  the  reciprocal  of  MT  -  essentially  an  expression  of  the  speed  of  completion 
of  the  task  -  rather  than  articulator  speed,  as  in  the  classical-mechanical  sense  of  \D/MT\. 
A  classical  definition  of  the  speed- accuracy  trade  off  might  be  IUi  =  cD/MT,  stating  that 
HA  is  proportional  to  articulatory  velocity,  given  some  coefficient  c.  This  classical 
definition  is  not  exactly  the  same  as  Fitts’  law,  but  the  two  can  be  related  be  rewriting 
Equation  2  2  as:  MT  =  log2(cD)  —  log^H^),  implying  that  W2  =  cD/2MT  (ignoring 
coefficients,  for  simplicity,  and  substituting  c  for  the  value  2).  The  quantity  cD/2hn  is  still 
not  the  classical  definition  of  speed,  but  it  similarly  decreases  monotonically  with  MT,  and 
the  quantities  IUi  and  IU2  from  the  Fitts’  and  classical  definitions  can  be  related  by  a 
multiplicative  factor,  W\  =  /7IU2,  where  77  =  2 MT /MT. 
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Theoretical  Framework 

Fitts’  law  has  substantial  mathematical  connections  with  the  dynamical  systems  view  of 
coordination  and  control  of  human  movement  (e.g.,  Turvey  (1990);  Davids  et  al.  (2003)). 
This  section  attempts  to  elucidate  those  connections,  and  to  provide  a  novel  argument  for 
expecting  behavior  consistent  with  Fitts’  law  in  speech  production  on  the  basis  of 
prominent  theories  of  speech  motor  control  and  neural  dynamics.  Within  the  dynamical 
systems  perspective,  one  representative  body  of  work  that  has  had  an  impact  on  modeling 
and  explaining  speech  articulation  is  that  of  Task  Dynamics  Saltzman  &  Kelso  (1987); 
Saltzman  &  Munhall  (1989).  Task  Dynamics  constitutes  a  control  system  that  allows  for 
the  description  and  achievement  of  directed  actions  in  a  relatively  high-level  task  space ,  as 
opposed  to  the  relatively  low-level  articulator  space ,  defined  by  variables  of  mobility  such 
as  muscle  activations.  An  example  of  a  task  space  for  a  manual  reaching  task  would  be 
three-dimensional  Cartesian  space,  as  opposed  to  the  articulatory  space  of  joint  angles  at 
the  shoulder,  elbow  and  wrist.  Task  space  for  a  speech  production  action  could  be  the 
space  defined  by  the  first  three  formant  frequencies,  or  the  space  defined  by  vocal  tract 
constriction  degree  and  location,  as  in  Articulatory  Phonology  Brownian  &  Goldstein 
(1992).  These  high-level  spaces  are  the  natural  spaces  in  which  to  define  the  goals  of 
directed  action,  and  Task  Dynamics  defines  a  rigorous  framework  in  which  motor 
commands  can  be  generated  in  articulator  space  toward  the  completion  of  movements  in 
task  space. 

In  Task  Dynamics,  the  targets  of  directed  movement  are  assumed  to  be  points  in  task 
space.  Those  targets  are  achieved  by  point-attractor  dynamics,  governed  by  2nd-oi(\.ei 
equations  of  motion  consistent  with  a  critically  damped  harmonic  oscillator,  the  dynamics 
of  which  are  well  understood  from  classical  mechanics.  For  the  sake  of  simplicity,  consider 
a  one- dimensional  task  space.  The  equations  can  be  written  as  follows: 

-  —cX  k(X  -  X0) 


m 


m 


(3) 
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where  X  is  the  displacement  of  the  controlled  variable  and  X0  is  the  target.  The  forward 
dynamics  take  the  form  of  a  second-order  dynamical  system,  conforming  to  Equation  3, 
that  transforms  the  error  signal,  AX,  into  the  second  derivative  of  the  articulator-space 
variable  u.  An  overview  of  the  control  flow  in  Task  Dynamics  is  shown  in  Figure  1. 
Equation  3  is  contained  within  the  box  labelled  “Forward  Dynamics”,  which  computes  the 
acceleration  of  u  from  AA"  =  X0  —  X.  Note  that  the  low-level  articulator  variables,  u,  and 
the  relevant  kinematic  transformations  between  task  and  articulator  spaces  are  not 
discussed  in  the  present  context.  This  is  because  dynamics  in  task  space  only  are  sufficient 
to  account  for  Fitts’  law. 

Fitts’  law  can  be  seen  as  a  direct  consequence  of  such  dynamics.  A  mathematical 
connection  can  be  made  through  an  examination  of  the  step  response  of  the  system,  which 
corresponds  to  the  sudden  appearance  of  a  new  target  in  task  space.  The  relevant  quantity 
then  becomes  the  settling  time  of  the  damped  harmonic  oscillator,  that  is,  the  time 
required  for  the  system  to  converge  within  a  certain  percentage  of  the  final  target  value, 
beginning  at  rest.  It  is  well  known  from  classical  mechanics  that,  in  the  case  of  critical 
damping,  the  rate  of  convergence  in  the  step  response  to  a  change  in  target  follows  a 
decaying  exponential.  That  is,  the  displacement  of  the  system  at  time  t  is  Xt  =  X0e~OJO<'t, 
in  a  system  where  the  natural  frequency  is  cj0  =  yjk/m,  and  the  damping  ratio  is 
Q  =  c/2mu0.  In  the  case  of  critical  damping,  £  =  1,  and  Xt  =  X0etx'k/m. 

Several  of  these  quantities  can  be  related  directly  to  those  in  the  formulation  of  Fitts’  law. 
The  value  t  can  be  considered  as  MT,  the  time  at  which  the  system  is  considered  to  have 
settled,  or  completed  its  action.  Given  that  the  movement  takes  time  t  to  complete,  and  Xt 
is  the  residual  displacement  of  the  controlled  variable  after  the  action  has  completed,  Xt 
can  be  equated  with  the  error  tolerance  W.  Furthermore,  Xo  is  equivalent  to  the 
movement  distance,  D,  if  the  movement  is  considered  to  begin  at  X  =  0.  Following  from 
these  identities,  we  can  express  the  step  response  equation  above  with  a  change  of 
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variables,  as  W  =  De  Vk/mMT .  This  can  be  easily  rewritten  as: 


MT  = 


which  is  already  similar  to  Fitts’  law  in  form.  We  can  find  the  conditions  under  which  they 
are  equivalent  by  setting  this  new  formula  for  MT  equal  to  the  one  taken  from  Fitts’  law. 
Beginning  -  for  the  sake  of  clarity  -  with  a  change  of  logarithm  base  from  the  law  expressed 
in  Equation  2  (corresponding  to  a  switch  of  units  in  ID  from  bits  to  nats ),  we  have: 


a-ln  ^  +6=-^ln  ^ 


It  is  easy  to  show  that  this  equation  holds  for  certain  values  of  a  and  b.  For  instance, 
assuming  that  a  =  ~^= v—  (the  reciprocal  of  the  natural  frequency  of  oscillation),  one  can 
solve  to  find  that  b  =  L.  Therefore,  Fitts’  law  conforms  to  the  predicted  kinematic 

yj  k/m 

behavior  of  a  damped  harmonic  oscillator,  which  is  consistent  with  the  behavior  of  a  Task 
Dynamic  control  system  when  acting  to  achieve  a  specific  movement  target. 

In  addition  to  the  kinematic  considerations  of  the  Task  Dynamics  model,  Fitts’  law  also 
has  substantial  mathematical  connections  with  models  of  the  neural  dynamics  underlying 


the  dynamical  systems  view  of  human  motor  control.  An  influential  neural-inspired 


network  model  for  explaining  kinematic  trajectory  formation  of  directed  movement  is  the 
VITE  model  Bullock  &  Grossberg  (1988).  This  model’s  predictions  are  highly  consistent 
with  those  of  the  Task  Dynamics  model,  owing  to  the  fact  that  VITE  is  a  2nfi-order 
dynamical  system  much  like  Task  Dynamics  (as  pointed  out  by,  e.g.,  Beamish  et  al. 
(2006)).  VITE  comprises  a  network  of  interacting  hypothesized  neural  populations  which 
generate  a  movement  command,  given  some  target  position.  The  neural  populations  are 


configured  in  order  to  code  distinct  quantities  that  are  needed  in  the  generation  of  the 
motor  command.  Among  the  interacting  neural  populations,  there  is  (a)  a  population 
representing  the  target  position  command  (TPC),  (b)  a  population  representing  the 
present  position  command  (PPC),  and  (c)  a  population  referred  to  as  the  difference  vector 
(DV)  population,  which  represents  the  difference  between  the  PPC  and  TPC. 
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The  specific  structure  of  VITE’s  interacting  network  is  shown  in  Figure  2.  Note  the  many 
similarities  of  this  structure  to  that  of  Task  Dynamics  in  Figure  1.  TPC,  as  a 
representation  of  the  target  position,  produces  a  target  position  X0.  The  DV  population 
compares  the  target  to  the  system’s  current  position,  and  computes  the  task-space 
dynamics  of  the  network.  The  PPC  population,  meanwhile,  integrates  the  DV  population 
activation  into  position  information,  in  analogy  to  the  physical  plant  in  the  Task  Dynamics 
control  flow.  The  network  dynamics  have  the  following  form: 

V  =  a(X0  —  X  —  V),  (6) 

and 

X  =  GV  (7) 

where  the  parameter  a  has  been  termed  the  “convergence  coefficient”  and  G  is  the  "go" 
signal,  which  initiates  and  sustains  movement.  These  equations  also  compare  easily  to  the 
equations  of  motion  for  Task  Dynamics  given  above  in  Equation  3.  There  are  important 
differences,  however  2 .  First,  all  computations  are  done  at  the  level  of  tasks,  with  no 
mention  of  the  articulator  space.  Therefore,  there  is  no  need  for  kinematic  transformations 
between  task  space  and  articulator  space  in  VITE.  Second,  the  inclusion  of  G  has  no 
equivalent  in  Task  Dynamics,  where  it  is  assumed  (implicitly)  that  movement  toward  a 
target  is  always  active  as  long  as  the  target  exists. 

As  with  Task  Dynamics,  Fitts’  law  can  be  seen  as  a  direct  consequence  of  these 
neural-inspired  dynamics.  This  can  be  shown  by  demonstrating  the  mathematical 
relationship  between  the  equations  of  motion  in  Equation  6  and  Equation  3.  If  G  —  1,  then 
V  =  X,  and  subsequently  that  6  and  7  collapse  into  the  single  equation: 

X  —  a(X0  —  X  —  X),  (8) 

2 Also,  this  presentation  glosses  over  a  nonlinearity  in  the  original  VITE  formulation,  where  V  is  not 
allowed  to  go  negative.  This  detail  was  not  seen  as  important  in  the  present  context. 
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which  is  the  same  as  Equation  3,  if  a  =  —k/m  =  —c/m.  Therefore,  VITE  is  consistent 
with  Task  Dynamics  control,  and  Fitts’  law  can  be  seen  as  related  to  both  those  models  in 
a  general  sense,  and  as  a  direct  consequence  of  them  under  the  specified  conditions  and 
parameters.  Note,  incidentally,  that  because  the  damping  coefficient  in  VITE  is  fixed  at 
c  =  — am ,  in  order  for  the  system  to  be  critically  damped  (i.e.,  c  =  2 y/mk),  as  in  Task 
Dynamics,  that  m  —  k/ 4.  Overdamping  will  occur  with  m  >  kj 4,  and  underdamping  with 
m  <  kj 4. 


Methodological  Framework 

To  apply  Fitts-style  analysis  to  speech  production  data,  it  is  necessary  to  operationally 
define  the  targets  of  articulation  in  space  and  time.  To  that  end,  it  is  assumed  that  a  single 
articulatory  target  is  associated  with  each  phoneme.  Targets  might  not  be  reached  during 
continuous  speech  for  a  variety  of  reasons,  including  undershoot,  misarticulation,  or 
tolerance  of  the  controller  to  some  deviation  from  the  target.  However,  it  is  assumed  that 
the  action  associated  with  a  given  phone  comes  closest  to  achieving  its  target  at  the 
temporal  center  of  the  associated  phone  interval.  Thus,  each  targeted  task  in  continuous 
speech  can  be  conceptualized  as  movement  from  one  phoneme  target  to  another, 
constituting  a  specific  diphone.  Tasks  conceptualized  this  way  can  also  be  referred  to  by 
diphone,  which  represents  a  context-target  task  pair.  It  is  further  assumed  that  the  target 
of  a  given  phoneme  is  a  vector  in  high-dimensional  articulatory  space.  The  location  of  that 
vector  is  estimated  as  the  mean  of  all  tokens  with  a  given  phoneme  label.  The  initial 
position  for  a  given  task  is  assumed  to  be  the  target  immediately  preceding  the  current 
one.  All  these  notions  will  be  defined  formally  below  in  Figure  3. 

It  has  been  well-established  that  the  temporal  relationship  between  speech  gestures  varies 
as  a  function  of  their  positions  within  the  syllable  Brownian  &  Goldstein  (1995);  Krakow 
(1999);  Byrd  et  al.  (2009).  Therefore,  it  was  hypothesized  that  adherence  to  Fitts’  law 
might  vary  depending  on  the  task  type,  where  type  was  determined  by  syllabic  position. 

To  facilitate  analysis  of  speech  tasks  conditioned  on  syllable  position,  a  syllabification  was 
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performed  five  categories  of  interest  were  defined  with  respect  to  syllable  structure  (see 
Figure  4): 

•  Category  1:  Onset-Nucleus  Task  (initial  position:  final  onset  consonant;  target:  syllable 
nucleus) 

•  Category  2:  Nucleus-Coda  Task  (initial  position:  syllable  nucleus;  target:  first  coda 
consonant) 

•  Category  3:  Onset-Onset  Task  (initial  position:  onset  consonant;  target:  succeeding  onset 
consonant) 

•  Category  4:  Coda-Coda  Task  (initial  position:  coda  consonant;  target:  succeeding  coda 
consonant) 

•  Category  5:  Coda-Onset  Task  (initial  position:  final  coda  consonant;  target:  first  onset 
consonant  of  succeeding  syllable) 

Note  that  the  tasks  in  category  5  are  across  syllables,  whereas  the  tasks  in  categories  1-4 
are  all  within  a  single  syllabic. 

Method 

Data,  Pre-Processing  Feature  Extraction 

Data  used  in  the  present  study  are  from  the  USC-TIMIT  database  Narayanan  et  al. 
(2014).  USC-TIMIT  is  a  publicly-available  collection  of  speech  production  data  from 
speakers  of  American  English.  Speech  articulation  data  were  gathered  for  the  database 
using  two  different  modalities,  rtMRI  and  electromagnetic  articulography  (EMA).  The 
rtMRI  data  were  used  in  the  present  analysis.  Resolution  of  the  rtMRI  data  is  68  by  68 
pixels,  with  pixels  2.9  by  2.9  mm  in  size,  at  a  frame  rate  of  23.18  frames/s.  Audio  was 
simultaneously  recorded  at  a  sampling  frequency  of  20  kHz,  and  later  subjected  to  noise 
cancellation  Bresch  et  al.  (2006). 
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The  rtMRI  data  from  all  five  male  (M)  and  five  female  (F)  subjects  from  the  database  (i.e., 
Ml-5  and  Fl-5)  were  used  in  the  present  analysis.  Forced  phoneme  alignment  was  carried 
out  using  SAIL- Align  Katsamanis  et  al.  (2011).  Subjects  were  analyzed  separately,  due  to 
concerns  about  the  proper  method  of  combining  articulatory  features  across  subjects. 
Subjects  read  aloud  the  460  sentences  constituting  the  MOCHA-TIMIT  corpus  Wrench 
(1999).  For  three  of  the  speakers,  software-related  difficulties  resulted  in  MRI  frames  going 
unrecorded  in  the  data,  which  makes  ideal  audio-video  synchronization  impossible. 
Sentences  in  which  this  problem  arose  were  discarded.  In  the  end,  346  of  the  4600  total 
sentences  were  discarded,  including  175  from  F4  (sentences  286  to  460),  166  from  M5 
(sentences  295-460)  and  only  five  sentences  from  M3  (sentences  331-335).  All  460  sentences 
were  represented  in  the  data  for  the  other  seven  subjects. 

The  analysis  presented  here  began  by  treating  the  gray-scale  intensity  values  of  each  pixel 
in  the  image  plane  as  a  candidate  articulatory  feature  Lammert  et  al.  (2010);  Lammert, 
Ramanarayanan,  et  al.  (2013).  These  candidate  features  were  pre-processed  and 
recombined  prior  to  analysis,  in  order  to  produce  new  features  that  are  fewer  in  number 
and  more  specific  to  speech  articulation  (details  below).  Such  a  pixel-wise  approach  may 
seem  unintuitive,  but  it  provides  the  opportunity  to  analyze  data  about  the  entire 
midsagittal  plane,  while  making  minimal  assumptions  about  what  information  might  be 
important  for  describing  articulation.  Pixel- wise  analysis  is  also  relatively  robust  compared 
to  a  more  traditional  edge-detection  and  boundaries-extraction  approach  when  applied  to 
low-contrast,  low  spatial-resolution  rtMR  images  Lammert  et  al.  (2014). 

The  rtMRI  image  sequences  were  pre-processed  to  facilitate  further  analysis,  in  particular 
to  (a)  isolate  frames  of  interest,  and  (b)  reduce  the  high  dimensionality  of  the  data  to  a 
manageable  number.  Analysis  began  with  an  image  sequence,  X,  of  the  form 
X  =  [ /1/2/3  •  •  •  In]T ,  comprising  all  n  image  frames  Im.  in  the  corpus  from  a  single  subject, 
where  the  images  Im  are  vectorized  in  column  format.  That  is,  pixels  located  at  (i,j)  in 
rectangular  r  by  c  image  format  are  now  located  at  c(i  —  1)  +  j  in  the  vector  /,  and  /  is  of 
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length  rc.  Prior  to  further  analysis,  images  underwent  an  intensity  correction  procedure  to 
compensate  for  the  reduction  in  coil  sensitivity  moving  posteriorly,  from  the  lips  toward  the 
pharynx  (i.e.,  at  increasing  spatial  distance  from  the  coil).  A  retrospective  correction 
scheme  was  implemented,  incorporating  a  nonpar ametric,  monotonically  increasing 
estimate  of  coil  sensitivity,  which  was  derived  from  all  pixel  values  in  the  video  sequence 
Lammert,  Ramanarayanan,  et  ah  (2013).  Decreasing  coil  sensitivity  results  in  lower  mean 
and  smaller  dynamic  range  of  intensity  values  for  pixels  at  large  distances  from  the  coil. 
Intensity  correction  must  be  done  to  ensure  that  pixel  intensity  values  can  be  compared 
and  interpreted  across  all  spatial  locations.  Image  intensity  correction  results  in  a  matrix 
Xc  of  corrected  image  vectors. 

Pixels  that  are  unrelated  to  vocal  tract  action  were  eliminated  by  a  simple  threshold 
procedure.  Pixels  representing  the  air  around  the  head,  or  representing  static  spinal  or 
brain  tissue,  have  intensities  that  change  very  little  over  the  image  sequence.  These  pixels 
can  be  identified  by  calculating  the  variance  along  columns  of  Xc,  and  selecting  only 
columns  with  highest  variance.  Such  pixels  represent  approximately  75%  of  all  pixels  in  the 
images  analyzed  in  the  present  work,  as  identified  by  visual  inspection  of  the  images. 
Therefore,  the  matrix  X/.ub  was  formed,  which  contained  only  those  columns  of  Xc  with 
variance  above  the  7Ath  percentile  across  all  columns. 

The  matrix  X°ub  is  therefore  n  by  rc/ 4  in  size,  but  only  a  subset  of  the  n  data  vectors 
represent  vocal  tract  configurations  temporally  close  to  an  articulatory  target.  Using  the 
above  operational  definition  of  articulatory  targets,  the  row  vectors  in  Xrsuh  corresponding 
to  the  temporal  centers  of  phones  are  identified  and  extracted.  From  the  forced  alignment, 
each  phone  is  assigned  a  starting  boundary  Am,  and  an  ending  boundary  Bm,  both  in 
seconds.  From  these,  the  temporal  center  of  a  phone  can  be  calculated  as 
Fm  =  (Am  +  Bm)/ 2,  and  the  corresponding  image  frame  is  argm  min(r,m  —  rm)2  for 
timestamps  T\, ...  ,rn  associated  with  each  original  image  frame.  In  this  way,  a  new  matrix 
Y  is  formed,  which  is  P  by  rc/4  in  size,  where  P  is  the  total  number  of  phones  represented 
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in  the  image  sequence,  P  ~  15121. 

Principal  Component  Analysis  (PCA)  was  employed  to  further  reduce  the  data 
dimensionality.  Z  =  YCl  was  computed,  where  C  is  the  matrix  whose  columns  are 
eigenvectors  of  YYT,  and  Cl  is  a  matrix  containing  only  L  columns  that  represent 
eigenvectors  with  the  highest  eigenvalues  (i.e.,  the  largest  principal  components).  The 
magnitude  of  L  was  chosen  so  as  to  retain  >  85%  of  the  variance  for  each  subject  being 
analyzed.  Across  the  subjects  analyzed  in  the  present  study,  L  was  approximately  equal  to 
50.  The  resulting  P  by  L  matrix  Z ,  which  contains  a  reduced- dimension  representation  of 
each  vocal  tract  configuration  nearest  to  an  articulatory  target,  was  used  for  all  subsequent 
analyses.  Images  illustrating  the  key  stages  in  this  image  pre-processing  pipeline  are  shown 
in  Figure  5.  It  is  worth  noting  that  PCA  performed  with  the  correlation  matrix,  rather 
than  the  covariance  matrix,  might  provide  an  alternative  method  of  dimensionality 
reduction  that  would  also  obviate  the  need  for  image  intensity  correction. 

Syllabification  was  performed  based  on  the  forced  alignment  results,  beginning  with  the 
word-level  transcription  from  the  adaptive  forced  alignment  procedure.  Words  were 
translated  into  phoneme  sequences  hireling  their  entries  in  the  CMU  Pronouncing 
Dictionary,  which  are  already  syllabified.  Syllables  were  subsequently  divided  into  onset, 
nucleus  and  coda  by  identifying  the  vowel  as  the  nucleus,  and  considering  all  phones 
preceding  the  nucleus  as  part  of  the  onset,  and  all  phones  following  the  nucleus  as  part  of 
the  coda.  This  syllabification  allowed  for  (a)  partitioning  tasks  into  the  meaningful 
categories  of  interest  with  respect  to  syllabic  structure,  and  (b)  calculation  of  syllable 
position-specific  movement  times. 

Distance  &  Width  Calculations 

For  the  purposes  of  analysis,  a  phoneme  vector  n  is  defined,  which  is  of  length  P.  The  pth 
element  of  n,  np,  is  a  numerical  index  from  1  to  35,  uniquely  specifying  an  American 
English  phoneme,  and  representing  the  phoneme  associated  with  row  p  of  Z.  The  vector 
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Sg,  which  is  of  length  P ,  is  associated  with  a  given  phoneme  index  g  from  1  to  35.  Sp  =  1 
whenever  Hp  =  g,  and  0  elsewhere.  The  mean  configuration  vector  associated  with  the 


phoneme  indexed  by  g  is 


F9  = 


lTdiag(Sg)Z 
II  Sg  lb 


where  1  is  a  vector  of  ones.  The  vector  F9  represents  our  operationally-defined  articulatory 
target  associated  with  the  phoneme  indexed  by  g. 

For  every  pair  of  phoneme  indices  g  and  h,  it  is  now  possible  to  state  precisely  the  spatial 
distance  between  the  associated  phonemes.  Using  the  Euclidean  distance  in  the 
L-dimensional  articulatory  space,  the  distance  Dgh  —  \\Fg  —  Fh\\.  A  graphical 
representation  of  this  can  be  seen  in  Figure  6. 

To  calculate  the  time  to  reach  phoneme  h  from  g  (indices),  assume  that  Sgh  is  a  vector  that 
is  1  whenever  both  np  =  h  and  np_!  =  g.  Similarly,  Shg  is  1  whenever  both  Up  =  g  and 
np+i  =  h.  The  mean  time,  then,  between  the  phonemes  indexed  by  g  and  h  across  all 


instances  is 


Tgh  — 


lTdiag(Sgh)T  -  lTdiag(Shg)T 

II  Sg  ||i 


As  mentioned  in  the  discussion  of  syllabification  above,  it  was  hypothesized  that  adherence 
to  Fitts’  law  might  vary  depending  on  syllable  position,  because  the  temporal  relationships 
between  speech  gestures  are  known  to  vary  as  a  function  of  syllable  position.  Therefore, 

Tgh  was,  in  fact,  calculated  also  for  each  of  the  five  syllable  position-based  categories  listed 
above.  When  the  linear  relationship  between  ID  and  T  was  assessed  on  the 
category-by-category  basis,  it  was  this  category-specific  value  for  Tgh  that  was  used. 

There  are  many  possible  definitions  for  the  width  of  the  target  in  a  speech  production  task. 
There  are  no  hard  physical  limits  around  the  target,  as  in  Fitts’  original  experiments,  which 


necessitates  exploring  other  definitions.  Width  could  be  defined  in  terms  of  variability 
about  the  target,  as  in  later  measures  of  “effective"  width  Welford  (1968);  Fitts  &  Peterson 
(1964).  Other  definitions  have  been  based  on  the  amount  of  under /overshoot  associated 
with  a  particular  movement  Bullock  &  Grossberg  (1988).  However,  the  nature  of  speech 
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being  such  that  phonetic  contrasts  can  be  made  with  very  small  changes  in  vocal  tract 
configuration,  allows  for  the  possibility  of  another  definition  based  on  the  density  of  targets 
in  articulatory  space.  Consider  the  distance  values  Dfh  for  a  given  h  and  all  /  =  1, . . . ,  35. 
These  distance  values  with  respect  to  h  can  be  sorted  and  ranked,  and  -  given  a  parameter 
k  -  we  can  select  the  distance  between  iy,  and  the  kth  closest  vector  Ffk-  That  distance  can 
be  used  as  the  basis  for  a  high-dimensional  k-nearest-neighbor  density  calculation.  The 
probability  density  of  configuration  vectors  in  the  neighborhood  of  Fh  will  be: 


Qh  — 


k 


opr^r YU  V)L 

,0r(4'-ri)  h'' 


in) 


where  r(Y)  is  the  gamma  function  and  35  is  the  number  of  phonemes  under  consideration 
(24  consonants  and  11  vowels,  with  no  diphthongs  or  rhoticized  vowels).  The  width  can  be 
calculated  from  this  probability  density  as  W g  =  —  log2(<53).  Note  that  the  final  width 
value  does  not  depend  on  the  context. 

Fitts’  law  can  be  calculated  directly  using  Dgh,  Tgh  and  H4  for  any  phoneme  indexed  by  h, 
and  presented  in  the  context  of  another  phoneme  g.  Applying  Equation  1,  it  is  possible  to 
calculate  IDgh  =  log2(2D9*./H4)-  Furthermore,  by  Equation  2,  we  expect  that 
Tgh  =  a  ■  IDgh  +  b,  for  some  coefficients  a  and  b.  Images  of  initial  positions  and  targets  for 
one  example  each  of  high-  and  low-ID  tasks  are  shown  in  Figure  7. 


Results  and  Discussion 

The  strength  of  the  relationship  between  MT  and  ID  was  assessed  using  linear  correlation 
(Pearson’s  r),  in  keeping  with  the  linear  form  of  Fitts’  law.  The  correlation  coefficients  are 
shown  in  Table  1,  divided  by  syllable  position-specific  category  and  by  subject.  Performing 
this  analysis  separately  for  each  subject,  and  once  for  each  task  category,  means  that  a 
total  of  5  x  10  =  50  individual  correlations  were  calculated,  with  50  corresponding  tests  for 
statistical  significance.  It  therefore  became  necessary  to  consider  the  significance  of  these 
results  in  light  of  some  kind  of  multiple  comparisons  adjustment.  It  is  not  clear  whether  or 
how  much  these  individual  correlations  are  dependent,  so  statistical  significance  of  the 
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correlation  coefficients  is  shown  at  three  distinct  threshold  values:  a  =  0.05  (Fisher’s 
traditional  value),  a  =  0.01  (an  intermediate  value)  and  a  =  0.001,  which  is  the 
conservative,  Bonferroni-adj listed  threshold  value. 

Subject  M2  had  the  generally  highest  correlation  values  of  all  subjects,  indicating  that  the 
Fitts-stylc  relationships  were  strongest  and  clearest  for  that  subject.  Figure  8  shows  MT 
versus  ID  for  subject  M2,  showing  the  strength  and  nature  of  those  relationships  for  each  of 
the  syllable  position-specific  task  categories.  The  correlation  values  corresponding  to  each 
category,  and  the  associated  p- values,  are  shown  above  each  plot. 

Articulatory  Difficulty 

Results  suggest  that  the  difficulty  associated  with  targeted  articulatory  kinematics  is 
highly  variable  in  speech  production.  ID  ranges  from  approximately  0.25  to  1.75  bits  for  all 
subjects.  A  few  general  patterns  in  the  distribution  of  ID  can  be  noted.  Difficulty  was 
assessed  by  looking  at  the  overall  average  ID  associated  with  a  given  target  position.  ID 
values  for  each  subject  were  normalized  between  0  and  1,  in  advance  of  taking  the  mean  ID 
for  each  task  across  subjects.  The  mean  ID  was  then  calculated  for  each  task  with  a 
consonant  target,  given  a  vowel  initial  position.  These  tasks,  listed  from  most  difficult  to 
least  difficult,  were:  3,  tf,  0,  h,  J,  p,  w,  <%,  3,  b,  g,  k,  j,  f,  r),  v,  z,  s,  m,  1,  d,  r,  n,  t.  The  mean 
ID  was  also  calculated  for  each  task  with  a  vowel  target  and  a  consonant  initial  position. 
These  tasks,  listed  from  most  difficult  to  least  difficult:  u,  o,  a,  o,  e,  ae,  u,  e,  i,  1,  9.  Figure  5 
shows  example  low-  and  high-ID  tasks  for  subject  M2. 

Consonant  tasks  involving  labial  articulation,  whether  primary  (/ p /,  /w/,  /b/),  or 
secondary  (/J/),  tend  to  have  a  higher  difficulty.  Nasals  and  liquids  all  ranked  as  lower 
difficulty.  Fricatives  /J/,  /0/,  /3/  and  affricates  show  higher  ID,  a  fact  that  is  consistent 
with  Hardcastlc’s  assertion  that  fricatives  -  and  perhaps  by  extension,  affricates  -  are  the 
sounds  of  speech  requiring  the  greatest  accuracy.  This  does  not  necessarily  include  all 
fricatives,  however,  as  /z/  and  /s/  were  ranked  relatively  lower.  The  difficulty  associated 
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with  producing  fricative  and  affricates  is  particularly  evident  when  examining  them  in  the 
context  of  low,  back  vowels,  where  the  distance  from  the  initial  position  to  the  target 
position  is  lengthened.  One  can  also  observe  that  these  articulatory  tasks  require  more 
time  to  complete  than  other  tasks.  Also  consistent  with  Hardcastle  is  the  observation  that 
stop  consonants  -  particularly  alveolar  -  require  little  accuracy,  and  are  therefore  not 
difficult.  Since  distance  is  a  factor  under  consideration,  one  can  see  that  this  effect  is  again 
emphasized  when  the  initial  position  is  a  high,  front  lax  vowel.  It  is  important  to 
remember  that  sibilants  may  have  complex  aerodynamic  requirements,  and  off-midsagittal 
kinematic  requirements,  that  will  not  necessarily  show  up  in  the  purely  midsagittal 
kinematic  analysis  in  the  present  work. 

Vowel  targets  that  were  low  and  back  had  a  higher  level  of  difficulty,  as  compared  to  the 
relatively  lower  difficulty  high  and  front  vowels.  Vowels  that  were  not  directly  along  this 
primary  low-back/high- front  axis,  including  the  high-back  vowel  /u/  and  the  low-front 
vowel  /ae/  appeared,  together  toward  the  middle  of  the  vowel  difficulty  ranking.  Schwa  was 
ranked  as  the  least  difficult  vowel  to  produce.  This  ranking  is  consistent  with  the 
importance  of  D  in  computing  ID.  With  a  schwa  target,  the  speech  articulators  should 
have,  on  average,  a  shorter  distance  to  travel  from  other  initial  positions.  It  has  been 
shown  that,  although  it  has  a  distinct  phonetic  identity,  schwa  is  perceptually  and 
articulatory  similar  to  “articulatory  setting”,  which  is  the  neutral  posture  from  which 
speech  actions  are  deployed  and  to  which  they  tend  to  return  Ramanarayanan  et  al.  (2013). 
This  neutral  posture  is  hypothesized  to  be  kinematically  advantageous,  just  as  there  is 
evidence  that  it  is  mechanically  advantageous  Ramanarayanan  et  al.  (2014).  Conversely, 
low-back  vowels  should  require  the  speech  articulators  to  travel  longer  distances  from  a 
variety  of  initial  positions,  in  order  to  reach  kinematic  targets  in  the  region  of  the  pharynx. 

One  other  issue  of  note  concerns  that  fact  that  the  distance  and  width  parameters  do  not 
seem  to  contribute  equally  to  the  index  of  difficulty,  given  the  present  definitions,  and 
current  data.  Although  both  parameters  influence  the  final  value  of  ID,  difficulty  seems  to 
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be  determined  to  a  much  larger  degree  by  distance  than  by  width.  For  instance,  for  subject 
M2,  the  correlation  between  D  and  ID  across  all  diphones  is  substantially  greater 
(Spearman’s  p  =  0.987,  n  =  1190,  p  ■ C  0)  than  the  correlation  between  W  and  ID 
(Spearman’s  p  =  0.222,  n  =  1190,  p<0).  Similar  trends  are  seen  across  all  subjects.  Note 
that  this  correlation  between  W  and  ID  is  in  the  opposite  direction  from  expected,  based 
on  the  equation  for  ID.  This  may  be  due  to  the  fact  that,  for  these  data,  D  and  W  seem  to 
be  positively  correlated  (e.g.,  for  subject  M2:  Spearman’s  p  =  0.3565,  n  =  1190,  p  -C  0). 

Speed- Accuracy  Tradeoff 

Results  suggest  that  targeted  speech  actions  exhibit  a  clear  tradeoff  between  speed  and 
accuracy  in  certain  task  categories,  and  with  substantial  interspeaker  variability. 

Significant  correlations  can  be  seen  in  the  data  that  correspond  to  the  relationship  between 
MT  and  ID  predicted  by  Fitts’  law.  The  strength  of  that  relationship  varies  across  speaker 
and  task  type.  The  strongest  and  most  highly  significant  of  such  relationships  are  seen  for 
Nucleus-Coda  tasks  across  all  subjects.  Onset-Nucleus  and  Coda-Onset  tasks  also  showed 
generally  high  correlations  that  were  significant  for  at  least  three  subjects  (M2,  FI  and  F2, 
but  also  M4  for  Coda-Onset  tasks).  Note  that  many  fewer  Onset-Onset  and  Coda-Coda 
tasks  exist,  as  compared  to  other  task  types.  For  speakers  that  display  significant 
correlations  between  ID  and  MT,  the  relationships  appear  linear  (see,  e.g.,  Figure  8),  but 
with  abundant  added  noise  around  the  trend  line.  There  is  also  a  notable  deviation  from 
linear  at  small  ID  values,  where  MT  appears  to  hit  a  minimum  value  around  50ms.  This 
floor  effect  may  reflect  physiological  constraints  on  the  production  apparatus. 

Despite  several  significant  correlation  values  between  MT  and  ID,  the  correlations  observed 
in  the  present  analysis  are  relatively  modest  compared  to  those  observed  in  other  domains 
of  human  movement.  Correlation  coefficients  above  0.9  are  commonly  reported  in  the 
literature  MacKenzie  (1992),  whereas  the  best  correlation  value  observed  in  the  present 
analysis  was  0.72  (Subject  M2,  Nucleus-Coda  task).  The  correlation  values  are  also  highly 
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dependent  on  the  task  type  and  the  subject  under  consideration.  One  question  raised  by 
such  results  is  why  this  seemingly  fundamental  tradeoff,  that  has  been  well-established  in 
other  motor  domains,  appears  to  be  somewhat  weakly  and  variously  obeyed  in  the  domain 
of  speech  production.  There  are  several  potential  explanations,  which  are  considered  below. 

The  class  of  movements  considered  ballistic  (i.e.,  occurring  without  feedback  control  while 
movement  is  underway)  provide  an  example  explanation  to  consider.  It  has  been  argued 
that  ballistic  movements,  including  eye  saccades,  do  not  obey  Fitts’  law  because  a  lack  of 
feedback  means  that  movement  time  does  not  depend  on  the  required  accuracy  of  the  task, 
but  only  on  the  movement  amplitude  Carpenter  (1988);  Drewes  (2013).  It  is  possible  that 
certain  speech  production  actions  implement  ballistic  control.  However,  even  despite  their 
rapidity,  speech  motor  tasks  are  typically  not  modeled  as  being  ballistic  in  nature.  Major 
models  of  speech  motor  control  involve  feedback  at  fine  temporal  scales.  It  has  already 
been  discussed  that  the  Task  Dynamics  model  relies  on  feedback,  and  leads  naturally  to 
precisely  the  kind  of  speed- accuracy  trade  offs  described  by  Fitts  law  Saltzman  &  Munhall 
(1989).  Other  prominent  models  of  speech  motor  control,  such  as  the  DIVA  model 
Guenther  et  ah  (1998)  and  State  Feedback  Control  Houde  &  Nagarajan  (2011),  also  rely  on 
feedback,  although  the  connection  to  Fitts’  law  has  not  been  explicitly  made.  Therefore, 
rather  than  concluding  that  each  of  these  models  is  inaccurate,  and  that  ballistic  control  of 
speech  movements  provides  a  better  explanation  of  the  present  data,  a  more  likely 
explanation  for  the  weak  and  variable  observed  relationships  is  that  the  definition  of  speech 
tasks  used  in  the  present  work  needs  to  be  revised  in  one  of  several  ways. 

One  way  to  reconsider  the  presently-used  definition  of  speech  tasks  is  to  make  them 
multimodal,  for  instance  by  incorporating  prosodic  constraints.  As  mentioned  above, 
speech  has  multiple  levels  in  which  accuracy  may  be  demanded.  Speech  motor  actions  have 
communicative  and  prosodic  goals,  in  addition  to  kinematic  requirements.  Temporal 
constraints  exist  as  part  of  those  goals,  both  at  the  level  of  phonetic  segments  (e.g., 
lengthening  as  a  phonemic  contrast)  and  suprasegmentally  (e.g.  accenting).  Indeed,  the 
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objective  function  for  speech  motor  control  might  be  formulated  as  an 
information-theoretic  measure  exemplifying  both  the  achievement  of  the  kinematic  goal 
and  any  temporal  information  encoding,  including  active  and  incidental  temporal  aspects. 
A  modification  of  speech  tasks  (and,  perhaps,  Fitt’s  law  itself)  is  needed  to  account  for 
these  various  levels  of  task  requirements,  and  associated  timing  requirements. 

Another  important  change  to  the  presently-used  definition  of  speech  tasks  may  be  to 
account  for  non-sequential,  overlapping  articulatory  targets,  as  opposed  to  the  purely 
sequential  tasks  considered  in  this  work.  In  fact,  the  results  already  indicate  the  need  for 
such  an  enhancement.  It  has  been  well  established  that  speech  articulatory  gestures  at 
certain  positions  in  the  syllabic  are  highly  overlapping,  whereas  others  are  more  sequential 
Brownian  &  Goldstein  (1995);  Krakow  (1999);  Byrd  et  al.  (2009).  Specifically,  onset 
consonants  would  be  expected  to  overlap  with  each  other  extensively,  and  would  overlap 
with  the  succeeding  nucleus,  as  well.  The  nucleus,  in  contrast,  should  overlap  very  little 
with  the  succeeding  consonant  in  the  coda.  The  present  results  clearly  show  that  the 
correlations  are  strongest  for  the  Nucleus-Coda  tasks  across  all  subjects,  which  is  exactly 
what  would  be  predicted  by  the  sequential,  non-overlapping  nature  of  the  gestures  involved 
at  that  position  in  the  syllable.  The  Onset-Onset  tasks,  for  which  the  assumption  of 
sequential  targets  may  be  inappropriate,  show  the  poorest  overall  correlations.  Moreover, 
speech  tasks  should  potentially  also  allow  for  contextually  modified  targets.  Enhancing 
speech  tasks  in  this  way  would  also  allow  for  a  natural  way  to  capture  co-articulation  in 
targets,  as  opposed  to  the  fixed  targets  considered  in  this  work. 

There  are  substantial  interspeaker  differences  in  the  strength  of  correlations  between  ID 
and  MT.  These  differences  are  evident  in  the  Nucleus-Coda  tasks,  where  most  subjects 
displayed  significant  correlations,  but  to  different  degrees.  Interspeaker  differences  are  also 
evident  for  other  tasks,  such  as  the  Onset-Nucleus  tasks,  where  some  subjects  showed 
marginal  correlations  (e.g.,  M3  and  F5)  and  others  (e.g.,  M2  and  F2)  showed  highly 
significant  correlations.  Important  questions  remain  regarding  an  explanation  for  this 
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prevalent  interspeaker  variability.  These  differences  may  reflect  interspeaker  differences  in 
control  strategies,  that  in  turn  are  a  function  of  speaking  rate,  age,  social  community, 
morphological  (i.e.,  physical)  variation,  and  a  variety  of  other  factors.  Morphological 
variation,  being  by  definition  a  fundamental  influence  on  kinematics,  holds  potential  as  an 
explanation  for  interspeaker  differences  even  in  a  seemingly  fundamental  law  of  motor 
control  and  behavior  like  Fitts’  law.  It  is  known  that  speakers  vary  widely  in  terms  of  a 
number  of  morphological  characteristics,  including  vocal  tract  length  Vorperian  et  al. 

(2005,  2009)  and  relative  proportions  Fitch  &  Giedd  (1999);  Vorperian  &  Kent  (2007),  as 
well  as  hard  palate  and  posterior  pharyngeal  wall  shape  Lammert,  Proctor,  &  Narayanan 
(2013b),  and  many  other  parameters.  There  is  growing  evidence  that  differences  in 
morphology  of  the  speech  apparatus  all  influence  the  production  of  specific  speech  sounds 
at  the  level  of  articulatory  goals  and  kinematics  Dart  (1991);  Brunner  et  al.  (2009);  Fuchs 
et  al.  (2008);  Lammert,  Proctor,  &  Narayanan  (2013a).  A  particularly  intuitive  example 
comes  from  indications  that  individuals  vary  in  terms  of  their  tongue  size  relative  to  the 
size  of  the  entire  speech  apparatus  Lammert,  Hagedorn,  et  al.  (2013).  It  seems  reasonable 
to  expect  that  a  smaller  relative  tongue  size  will  result  in  longer  articulatory  distances 
travelled  within  the  oral  and  pharyngeal  cavities,  on  average,  resulting  in  a  wide  range  of 
values  for  ID.  This  wider  range  of  ID  might,  in  turn,  cause  the  relationship  between  ID  and 
MT  to  stand  out  against  any  noise  in  the  data.  The  potential  effects  of  morphology  also 
points  at  specific  hypotheses.  For  instance,  it  has  already  been  discussed  in  the  present 
work  how  low-back  vowels  appear  to  be  the  most  difficult  vowels  to  produce,  and  it  was 
suggested  that  this  may  be  the  result  of  longer  articulatory  distances  associated  with 
producing  them.  It  has  been  well-documented  that  males  have  a  proportionally  longer 
pharynx  than  females  Vorperian  et  al.  (2011),  which  would  amplify  the  distances  required 
for  the  tongue  to  travel,  causing  further  increases  in  ID  associated  with  low-back  vowels, 
and  likely  a  wider  range  of  ID  overall.  The  potential  connection  between  vocal  tract 
morphology  and  Fitts’  law  for  speech  production  merits  further  attention. 
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It  should  be  noted  that  there  are  many  sources  of  variability  in  the  present  analysis  that 
may  have  had  an  impact  on  the  correlation  values,  and  may  limit  the  generality  of  these 
results.  One  limitation  relates  to  the  accuracy  of  finding  a  video  frame  near  the  temporal 
center  of  a  given  phone,  which  is  limited  by  the  temporal  resolution  of  rtMRI  and  the 
quality  of  forced  phoneme  alignment.  Recent  advances  in  rtMRI  protocols  may  alleviate 
this  limitation  Lingala,  Sutton,  et  al.  (2016);  Lingala,  Zhu,  et  al.  (2016).  If  speaking  rate  is 
a  factor,  then  the  correlation  values  from  Table  1  should,  in  turn,  be  correlated  with 
speaking  rate.  Speaking  rate  was  computed  for  all  subjects  by  looking  at  the  mean  time 
between  adjacent  syllable  nuclei  to  get  an  estimate  of  syllable  rate.  Pearson’s  correlations 
were  found  between  these  values  and  the  correlation  values  for  Nucleus-Coda  Consonant  (r 
=  -0.64,  p  =  0.047)  and  Onset  Consonant-Nucleus  (r  =  -0.63,  p  =  0.049).  The  fact  that 
speaking  rate  is  a  factor  indicates  that  the  current  data  may  have  frame  rates  that  are  at 
the  boundary  of  usefulness  for  the  analysis  done  in  this  study.  Higher  frame  rates  would  be 
preferable  in  future  work.  Additional  variability  may  stem  from  non-Gaussian  noise  on 
pixel  intensity  values  that  rtMRI  images  often  contain.  Added  variability  in  the  data  and 
analysis  would  have  the  clearest  impact  on  the  Onset- Onset  and  Coda-Coda  task  results, 
due  to  their  much  smaller  number.  Data  are  also  limited  to  a  midsagittal  view  of  the 
speech  articulators,  meaning  not  all  kinematic  aspects  are  captured  in  the  data. 


Conclusion 

This  paper  has  presented  an  analysis  of  speech  articulation  from  a  large  database  of 
real-time  magnetic  resonance  (rtMRI)  data,  in  order  to  assess  whether  articulatory 
kinematics  conform  to  Fitts’  law.  It  appears  that  certain  aspects  of  speech  production  do 
conform  to  Fitts’  law,  with  the  strength  of  that  relationship  varies  across  speaker  and 
context-target  type.  The  strongest  such  relationships  are  seen  for  VC  context-target  tasks, 
with  CV  tasks  showing  nearly  as  strong  correlations.  Also  presented  was  a  novel 
methodology  for  addressing  the  challenges  inherent  in  performing  Fitts-style  analysis  on 
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rtMRI  data  of  speech  production,  from  defining  the  key  quantities  to  extracting  them  from 
rtMRI  data.  Finally,  a  novel  mathematical  argument  was  presented  for  the  expectation  of 
Fitts’  law  in  speech  production,  and  why  one  expects  to  observe  behavior  consistent  with 
the  law  on  the  basis  of  Task  Dynamics  and  the  VITE  neural  model  of  directed  movement. 
Future  work  should  focus  on  addressing  the  remaining  methodological  challenges.  Among 
these  challenges  are  higher  frame  rate  data,  and  exploring  additional  definitions  of  the  key 
relevant  quantities. 
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Pearson’s  r  (and  p-values)  between  movement  time  (MT)  and  index  of  difficulty  (ID)  for 
all  subjects,  divided  by  syllable  position-specific  category.  Correlation  coefficients  significant 
at  the  a  =  0.05,  a  =  0.01  and  a  =  0.001  level  are  marked  with  *,  **  and  *  *  *,  respectively . 
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X0 


Figure  1 .  Schematic  representation  of  the  Task  Dynamics  framework.  The  variable  X  is 
the  displacement  of  the  controlled  variable  in  task  space  and  X0  is  the  target.  The  Forward 
Dynamics  component  implements  a  second-order  dynamical  system,  conforming  to 
Equation  3,  that  transforms  (via  inverse  kinematics)  the  error  signal,  AX,  into  the  second 
derivative  of  the  articulator-space  variable  u.  The  integrals  u  and  u  function  as  motor 
commands  to  the  Plant,  or  speech  production  apparatus. 
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Figure  2.  Schematic  representation  of  the  VITE  neural  model  Bullock  &  Grossberg  (1988). 
Note  the  many  similarities  of  this  structure  to  that  of  Task  Dynamics  in  Figure  1.  TPC  is 
a  representation  of  the  target  position,  which  produces  a  target  position  X0.  The  DV 
population  compares  the  target  to  the  system’s  current  position,  and  computes  the 
task-space  dynamics  of  the  network.  The  PPC  population  integrates  the  DV  population 
activation  into  position  information.  The  network  dynamics  have  the  form  described  in 
Equations  6  and  7. 
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•  Target  1  •  Target  2  •  Target  3  •  Target  4 

•  Initial  Position  1  •  Init  Pos  2  •  Init  Pos  3  •  Init  Pos  4 


C-V 


V-C  C-V  V-C 


Figure  3.  A  key  concept  behind  the  methodology  developed  in  the  present  work  is  that 
motor  tasks  in  speech  articulation  can  be  viewed  as  a  sequence  of  movements  toward  and 
away  from  target  points  in  articulatory  space.  Those  targets  are  assumed  to  be  approached 
and  approximated,  but  not  necessarily  reached,  at  the  temporal  center  of  each  phone 
interval.  The  initial  position  for  a  given  task  is  assumed  to  be  the  target  immediately 
preceding  the  current  one. 
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Figure  4  ■  Illustration  of  the  different  syllable  position-specific  task  categories  used  in  the 
present  analysis,  shown  on  a  traditional,  generic  syllable  structure  tree.  Categories  are 
numbered  outward  from  the  nucleus,  and  include  tasks  leading  into  and  out  of  the  nucleus 
(1  &  2),  tasks  between  consonants  in  the  onset  and  coda  (3  &  4)  and  tasks  leading  from 
one  syllable  to  the  next  (5). 
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ABC 

Figure  5.  Images  illustrating  stages  in  the  data  pre-processing  pipeline  for  a  single  vocal 
tract  posture.  Shown  are  (a)  an  image  of  a  single  posture,  in  its  original  form,  (b)  the  same 
image  with  low-variance  pixels  masked  out  (c)  the  image  again,  reconstructed  as  an  image, 
but  using  only  the  L  PCA-generated  features. 
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Figure  6 .  Illustration  of  the  key  relationships  in  calculating  ID  from  articulatory  data, 
with  most  variable  names  taken  from  the  text.  Target  vectors  are  defined  in  the 
high-dimensional  articulatory  space,  represented  in  the  illustration  by  features  X\ ,  x2,  x3. 
In  the  analysis,  this  articulatory  space  is  actually  composed  of  L  total  features.  The 
articulatory  target  vector  Fg  is  the  target  of  the  previous  movement,  and  represents  the 
starting  point  of  the  current  movement.  The  target  of  the  current  movement  is  Fh.  The 
distance  to  the  target  is  the  Euclidean  distance  between  these  two  vectors.  The  width 
around  the  target  is  calculated  with  respect  to  a  hypersphere  around  the  current  target, 
which  is  used  to  estimate  the  density  of  other  target  vectors  that  are  not  the  current  one. 


SPEED-ACCURACY  TRADEOFFS  IN  HUMAN  SPEECH  PRODUCTION 


42 


A.  initial  position:  a  B.  target:  J 


C.  initial  position:  i  D.  target:  d 


Figure  7.  Example  high-  and  low-ID  tasks  for  subject  M2.  The  top  row,  (a)-(b),  represent 
one  of  the  highest  ID  tasks,  while  the  bottom  row  (c)-(d)  represents  one  of  the  lowest. 
Images  were  reconstructed  from  the  L  articulatory  features  in  Z  (see  text). 
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Figure  8.  Movement  time  (MT)  vs.  index  of  difficulty  (ID)  for  subject  M2.  All 
context-target  tasks  are  shown,  divided  by  syllable  position-based  category  (see  text  for 
details  concerning  categories). 


